An In-Depth Analysis of the Social Media Platform Trell#


Introduction#

Welcome to our data story project, where we embark on an exciting journey to explore the depths of the social media platform Trell. Through a series of interlinked visualizations and explanatory text, we aim to unravel the intricate relationships within Trell’s user data and shed light on the factors influencing user engagement.

Trell, a popular social media platform, offers users a unique space to discover, create, and share their experiences through captivating visual content. In this project, we dive into a comprehensive dataset that encompasses a wide range of attributes related to Trell’s users. From user demographics and activity patterns to engagement metrics and content preferences, our dataset provides a rich foundation for uncovering fascinating insights.

Before we delve into the analysis, we diligently preprocess the dataset to ensure data quality and relevance. Cleaning the dataset, handling missing values, and transforming variables where necessary form the crucial groundwork for our exploration. By employing best practices in data preprocessing, we ensure that our subsequent analyses and visualizations are accurate and informative.

Throughout the project, we actively seek feedback from our Teaching Assistant (TA) and peers, recognizing the value of diverse perspectives in refining our analysis and improving the clarity of our visualizations. This iterative process enables us to present a compelling data story that effectively communicates the insights derived from the Trell dataset.

Join us on this captivating journey as we uncover the correlations between various attributes within Trell and unravel the secrets behind user engagement patterns. Through the fusion of data, visualizations, and explanatory text, we hope to empower researchers, marketers, and enthusiasts with a deeper understanding of the dynamic landscape of Trell.

Our perspectives#

Perspective 1: Trell’s Perspective#

From Trell’s standpoint, understanding user behavior and preferences is crucial for improving the platform’s functionality and enhancing user satisfaction. Through this perspective, we delve into the dataset to uncover valuable insights that can inform strategic decisions and shape Trell’s future development.

We examine the correlations between attributes such as user demographics, content viewing patterns, and engagement metrics to gain a comprehensive understanding of Trell’s user base. By analyzing trends related to weekends vs. weekdays, timeslots of content consumption, and the impact of hashtags and emojis on user interactions, we aim to provide Trell with valuable insights to optimize user experience and drive platform growth.

Argument #1: The best content is (probably) created by males over 30.

  • Figure 3: Male over 30 create the most content of all age and gender groups. Because they create the most content, there is a big chance that the best creators of Trell are often males over 30. The watcher can conclude he can (probably) have the best quality videos when he watches content created by men over 30

Perspective 2: Content Creator on Trell#

As a content creator on Trell, you play a vital role in shaping the platform’s landscape and engaging with its user base. Through this perspective, we aim to provide insights into the factors that contribute to your success and help you optimize your content creation strategy.

By analyzing the dataset, we explore the correlation between various attributes and the content creator’s performance on Trell. We investigate factors such as following rate, average age of followers, and repetitive punctuation usage to understand their impact on content reach and engagement. Through visualizations and data-driven analysis, we aim to empower content creators with actionable insights to enhance their content’s visibility and impact.

Argument #1: The best time to upload is between 18:00-00:00.

  • Figure 1: This histogram shows that between 18:00-00:00 the greatest amount of videos are watched by users of Trell, for all the 6-hour-intervals there are in a day. This means that on that particular interval, the greatest amount of users is using Trell. If you want your videos to be watched as much as possible, you have to upload at the time most users are online.

Argument #2: The best age and gender group to focus on is girls under 18.

  • Figure 2, 4: From Figure 2, we can see that people under 18 watch by far the most videos on Trell. From Figure 4, we can see that females spent the most time on the app, because the first quartile, median, third quartile and upper fence from the female’s boxplot are higher than the male’s boxplot. If you want your videos to be watched as much as possible, you have to target girls under 18.

Perspective 3: Viewer on Trell#

As a viewer on Trell, you are an integral part of the platform’s ecosystem, consuming and engaging with the captivating content created by its users. Through this perspective, we aim to uncover insights that enhance your viewing experience and provide a deeper understanding of the content you encounter on Trell.

By analyzing the dataset, we explore the correlations between user attributes and viewing patterns, seeking to understand the factors that drive your engagement and preferences on Trell. We examine variables such as content duration, completion rates, and the impact of comments on content relevance to uncover trends that shape your viewing habits.

Argument #1: Trell needs more staff throughout the day for the best service.

  • Figure 1: This histogram shows that for every slot, in the next slot are more videos watched. This means that throughout the day, more users are getting active on Trell (if the time watched per video stays the same). This means that in general there are progressively more difficulties with the app throughout the day. For this reason, Trell needs more staff that can help with troubleshooting throughout the day.

Argument #2: It’s smart to target different age and gender groups with different ads of Trell.

  • Figure 2, 3: From these figures, we can see the difference between the people who upload the most and people who view the most. From Figure 2, we can see that especially younger people watch videos. From Figure 3, we can see that especially older people create videos, especially men. This shows that there is a big difference in age groups between watching videos and uploading videos. This shows that it is smart to show ads of the possibilities as a viewer to younger people, and to show ads of the possibilities as a content creator to older people.

Dataset and preprocessing#

Our dataset ‘train_age_dataset.csv’ can be found on: https://www.kaggle.com/datasets/adityak80/trell-social-media-usage-data?resource=download&select=train_age_dataset.csv. It can be used to find correlations between certain data about users and how many videos they watch or how long they look at a certain post on average. The only form of preprocessing we really used was Tukey’s fences. We used the standard k value of 1.5 to sort out outliers, as we wanted to filter out the bots with for example 6 million seconds watchtime per video on average.

# Imports

import pandas as pd
from scipy.stats import pearsonr
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as pyo
import numpy as np
pyo.init_notebook_mode()
# Calculate all possible Pearson's R

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('train_age_dataset.csv')
list_corr = []

for column in df.columns:
    for target_column in df.columns:
        if column != target_column:
            df_cleaned = df.dropna(subset=[column, target_column])

            # Extract the two attributes as separate Series from the DataFrame
            x = df_cleaned[column]
            y = df_cleaned[target_column]
            # Calculate Pearson's correlation coefficient and p-value
            corr, p_value = pearsonr(x, y)

            # Print the correlation coefficient
            list_corr.append([corr, column, target_column])
            #print("Pearson's correlation coefficient:", corr)

list_corr.sort()

list_corr = list_corr[::2]

print(list_corr[-10:])
[[0.7359130644246615, 'slot2_trails_watched_per_day', 'weekdays_trails_watched_per_day'], [0.7472930535908913, 'content_views', 'slot2_trails_watched_per_day'], [0.7619040569337124, 'avgComments', 'num_of_comments'], [0.7766766416363531, 'content_views', 'weekends_trails_watched_per_day'], [0.7896744553131418, 'slot3_trails_watched_per_day', 'weekdays_trails_watched_per_day'], [0.7943382967746, 'slot4_trails_watched_per_day', 'weekdays_trails_watched_per_day'], [0.7951807026891985, 'content_views', 'slot4_trails_watched_per_day'], [0.7958924268797808, 'content_views', 'slot3_trails_watched_per_day'], [0.9275480634476255, 'content_views', 'weekdays_trails_watched_per_day'], [0.9396388917332891, 'followers_avg_age', 'following_avg_age']]
Hide code cell source
# Graph 1

data = pd.read_csv('./train_age_dataset.csv')

slot1_avg = data['slot1_trails_watched_per_day'].mean()
slot2_avg = data['slot2_trails_watched_per_day'].mean()
slot3_avg = data['slot3_trails_watched_per_day'].mean()
slot4_avg = data['slot4_trails_watched_per_day'].mean()

slot_averages = [slot1_avg, slot2_avg, slot3_avg, slot4_avg]

slots = ['00:00-05:59', '06:00-11:59', '12:00-17:59', '18:00-23:59']

fig = go.Figure(data=[go.Pie(labels=slots, values=slot_averages)])

fig.update_layout(
    title='Average Videos Watched per Time Slot',
    height=500
)

fig.show()
Hide code cell source
# Graph 2

# Read the data frosm CSV
data = pd.read_csv('train_age_dataset.csv')

# Map the age group values to the corresponding labels
age_labels = {
    1: '<18y',
    2: '18-24y',
    3: '24-30y',
    4: '>30y'
}

data['age_group'] = data['age_group'].map(age_labels)
data['age_group'] = pd.Categorical(data['age_group'], categories=age_labels.values(), ordered=True)

# Group the data by age group and calculate the mean of videos watched
grouped_data = data.groupby('age_group')['content_views'].mean().reset_index()

# Sort the grouped data by age group
grouped_data = grouped_data.sort_values('age_group')

# Create lists for age groups and total videos watched
age_groups = grouped_data['age_group'].tolist()
total_videos_watched = grouped_data['content_views'].tolist()

# Create the Plotly bar chart
fig = go.Figure(data=[go.Bar(x=age_groups, y=total_videos_watched)])

# Update the layout
fig.update_layout(
    xaxis_title='Age Group',
    yaxis_title='Average Videos Watched',
    title='Average Videos Watched per Age Group per Person',
    height=500
)

# Display the plot
fig
Hide code cell source
# Graph 3

data = pd.read_csv('train_age_dataset.csv')

# Define the age group labels
age_labels = {
    1: '<18y',
    2: '18-24y',
    3: '24-30y',
    4: '>30y'
}

# Map the age group labels to the age_group column
data['age_group'] = data['age_group'].map(age_labels)
data['age_group'] = pd.Categorical(data['age_group'], categories=age_labels.values(), ordered=True)

# Group the data by age group and gender and calculate the average videos uploaded per person
grouped_data = data.groupby(['age_group', 'gender'])['creations'].mean().reset_index()

# Separate data for each gender
male_data = grouped_data[grouped_data['gender'] == 1]
female_data = grouped_data[grouped_data['gender'] == 2]

# Create bar traces for male and female genders
male_trace = go.Bar(
    x=male_data['age_group'],
    y=male_data['creations'],
    name='Male',
    visible=True  # Set initial visibility to False
)
female_trace = go.Bar(
    x=female_data['age_group'],
    y=female_data['creations'],
    name='Female',
    visible=False,  # Set initial visibility to True
    marker=dict(color='red')
)

# Create the layout
layout = go.Layout(
    title='Average Videos Uploaded per Person by Gender and Age Group',
    xaxis=dict(title='Age Group'),
    yaxis=dict(title='Average Videos Uploaded'),
    height=500
)

# Create the figure and add the traces
fig = go.Figure(data=[male_trace, female_trace], layout=layout)

# Create dropdown menu buttons
buttons = [
    dict(
        args=[
            {'visible': [True, True]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show both traces
        label='Both',
        method='update'
    ),
    dict(
        args=[
            {'visible': [True, False]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show only male trace
        label='Male',
        method='update'
    ),
    dict(
        args=[
            {'visible': [False, True]},
            {'yaxis': {'range': [0, 0.064]}}
        ],  # Show only female trace
        label='Female',
        method='update'
    )
]


# Create the updatemenus property
updatemenus = [
    dict(
        buttons=buttons,
        direction='down',
        pad={'r': 10, 't': 10},
        showactive=True,
        x=0.9,
        xanchor='left',
        y=1.2,
        yanchor='top'
    )
]

# Update the figure layout with updatemenus
fig.update_layout(updatemenus=updatemenus)

# Add annotation
fig.update_layout(
    annotations=[
        dict(
            text='',
            showarrow=False,
            x=0,
            y=1.085,
            yref='paper',
            align='left'
        )
    ]
)

# Set the 'Both' trace as the initial visible trace
fig.update_traces(visible=True, selector=dict(name='Female'))

# Show the figure
fig.show()
Hide code cell source
# Graph 4

# Load the data from the CSV file
df = pd.read_csv('train_age_dataset.csv')

# Calculate the lower and upper bounds for outliers using Tukey's fences
Q1 = np.percentile(df['avgCompletion'], 25)
Q3 = np.percentile(df['avgCompletion'], 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers
df_filtered = df[(df['avgCompletion'] >= lower_bound) & (df['avgCompletion'] <= upper_bound)].copy()

filtered_age_groups = [1, 4]
df_filtered_subset = df_filtered[df_filtered['age_group'].isin(filtered_age_groups)]

# Map gender codes to labels
age_labels = {
    1: '<18y',
    4: '>30y'
}
df_filtered_subset.loc[:, 'age_group'] = df_filtered_subset['age_group'].map(age_labels)

# Create the boxplot
fig = px.box(df_filtered_subset, x='age_group', y='avgCompletion', color='age_group',
             labels={'age_group': 'Age Group', 'avgCompletion': 'Average Completion'},
             title='Boxplot of Average completion by age group (Outliers Removed)')

# Set the width and height of the figure
fig.update_layout(height=500)

# Show the boxplot
fig.show()
Hide code cell source
# Graph 5

data = pd.read_csv('./train_age_dataset.csv')

content_views_categories = pd.qcut(data['content_views'], q=3, labels=['Low', 'Medium', 'High'])
avgCompletion_categories = pd.qcut(data['avgCompletion'], q=3, labels=['Low', 'Medium', 'High'])
avgTimeSpent_categories = pd.qcut(data['avgTimeSpent'], q=3, labels=['Low', 'Medium', 'High'])
avgDuration_categories = pd.qcut(data['avgDuration'], q=3, labels=['Low', 'Medium', 'High'])

colors = {
    'Low': '#b0c4de',
    'Medium': '#3cb371',
    'High': '#e9967a'
}

fig = go.Figure(data=go.Parcats(
    dimensions=[
        {'label': 'Total Videos Watched', 'values': content_views_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Completion Rate', 'values': avgCompletion_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Average Duration of Videos user has watched', 'values': avgDuration_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']},
        {'label': 'Average Time Spent', 'values': avgTimeSpent_categories, 'categoryorder': 'array', 'categoryarray': ['High', 'Medium', 'Low']}
    ],

    line={
        'color': avgCompletion_categories.cat.codes,
        'colorscale': [[0, '#b0c4de'], [0.5, '#3cb371'], [1, '#e9967a']]
    }
))

fig.update_layout(title='Analysis of User engagement metrics', height = 500)
fig.show()
Hide code cell source
# Graph 6

# Read the data from the CSV file
data = pd.read_csv('train_age_dataset.csv')

# Define the tier labels
tier_labels = {
    1: '100,000+',
    2: '50,000 - 99,999',
    3: '20.000 - 49,999'
}

# Map the tier labels to the tier column
data['tier'] = data['tier'].map(tier_labels)
data['tier'] = pd.Categorical(data['tier'], categories=tier_labels.values(), ordered=True)

# Calculate the mean content_views per tier
mean_data = data.groupby('tier')['content_views'].mean().reset_index()

# Sort the data by the tier labels
mean_data = mean_data.sort_values('tier')

# Create bar trace for mean content_views
mean_trace = go.Bar(
    x=mean_data['tier'],
    y=mean_data['content_views'],
    name='Average Videos Watched',
    marker=dict(color='orange')
)

# Create the layout for mean content_views graph
mean_layout = go.Layout(
    title='Average daily videos watched per person',
    xaxis=dict(title='City population'),
    yaxis=dict(title='Daily videos watched'),
    height=400
)

# Create the figure for mean content_views graph
mean_fig = go.Figure(data=[mean_trace], layout=mean_layout)

# Show the mean content_views graph
mean_fig.show()

Reflection#

We got quite some feedback from the TA and our peers during the work session. We’ll go over every feedback we received and reflect on it.

The main feedback we received was about the data story, and the relation between the plots and our data story. The TA told us that our data story wasn’t coherent. Instead of having one main story, we have several small stories, one for every plot. This meant we had to create one story and have the plots we created collectively substantiate this singular story. This also caused us to look for more meaning behind every plot. Instead of having a couple of individual plots with their own meaning, the plots had to represent one singular thing when they are combined.

Other feedback related to the previous feedback is that it’s important to remember that the plots contribute to the data story, and not the opposite. The data story is the main attraction. The plots are arguments that help us clarify this data story.

Other important feedback we received was that for our plots containing information about the age groups, we should divide the information by the amount of people in each age group. With the original plots we created, we only obtain information about the total group for each age. When dividing by the amount of people in each age group, we can see information per person in each age group. This way, we obtain more important information. For example, an age group in our dataset is users with age > 30. The other age groups in our dataset are all intervals of 6 years. Because of this, more people with an age over 30 have uploaded videos on Trell, but simply because their age group is larger. When we plot the data per person, we can see that the group with age > 30 actually uploads the least amount of videos per person out of all age groups.

Other feedback was that there can be other factors involved in a plot that we don’t know about. Because of this, we can not always draw a conclusion with 100% certainty. An example is that females spent on average more time on Trell than men. Although it’s tempting to draw the conclusion that females watch more videos than men, this doesn’t have to be the case. Men can actually still watch more videos than women in this case, but only if the duration of a video is shorter.

Work distribution#